06_analysis_2

Author

group22

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data

df <- read_tsv("../data/03_dat_aug.tsv")

The second script of analysis will be related to the performing PCA and the creation of some models with different variations. These variations are based on the creation of three different models. The first of them will use the components used to reach 80% variability after the PCA analysis combined with the use of logistic regression for prediction. The second and third ones will be different GLM for men and women. Thanks to this we will be able to analyze whether gender has a special importance when predicting patients with diabetes.

PCA

It has to be mentioned that we have used the code provided by the lecturers (PCA tidyverse) and apply it using our data to achieve some conclusions. That is why the code will be the achieved conclusions will be completely different. Thus, that is why we want to highlight that we have been able to understand how Principal Component Analysis (PCA) works thanks to this code and how to interpret the obtained results. Lets start with PCA.

PCA is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can be used for many purposes, however, we have opted to apply this technique to create models with less variables to try to gain more meaningful conclusions of them. Thus, we are using it to reduce the dimensionality to simplify our data set with many variables, to make it more manageable for analysis.

Having said that, first thing that has to be done is to work only with numerical variables. PCA is typically applied to continuous variables, and it assumes that the variables are on a numerical scale. PCA is based on the covariance matrix, which involves computing variances and covariances between numeric variables. Therefore, directly applying PCA to a dataset with categorical variables is not appropriate However, there are techniques that extend PCA to handle categorical variables. One such technique is called Multiple Correspondence Analysis (MCA), which is an extension of PCA for categorical data. MCA is suitable for datasets where variables are categorical and can take more than two levels. As we are not working with categorical data, we will forget about this idea.

column_classes <- df |>
  map_chr(class)

numeric_columns <- names(which(column_classes == "numeric")) |>
  print()
 [1] "Diabetes_binary"      "HighBP"               "HighChol"            
 [4] "CholCheck"            "BMI"                  "Smoker"              
 [7] "Stroke"               "HeartDiseaseorAttack" "PhysActivity"        
[10] "Fruits"               "Veggies"              "HvyAlcoholConsump"   
[13] "AnyHealthcare"        "NoDocbcCost"          "GenHlth"             
[16] "MentHlth"             "PhysHlth"             "DiffWalk"            
[19] "Sex"                  "Age"                  "Education"           
[22] "Income"               "Education_binary"    

After checking which are the numerical variables of our data set, we will only select those ones to perform the analysis. Before using PCA, we have to make sure that all our data is playing on a level field. Scaling means adjusting the values so they’re all in a similar range.

pca_fit <- df |> 
  select_if(where(is.numeric)) |>
  prcomp(scale = TRUE)

Now, we will create a special kind of map for our data using PCA. To do this, we blend the PCA results with the original data, adding back the information we temporarily set aside. It’s like bringing back the colors to our points based on categories that were there in the first place but were temporarily taken away for PCA. We use a tool called augment() from the “broom” package to make this happen. It needs the model we created and the original data as inputs.

df_augmented <- augment(pca_fit, df)

ggplot(df_augmented, aes(.fittedPC1, .fittedPC2, color = Diabetes_Status)) + 
  geom_point(size = 3, alpha = 0.7, shape = 16) +
  scale_color_manual( values = c("Non-Diabetic" = "#FF5733", "Diabetic" = "#33FF57")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        panel.background = element_rect(fill = "lightgray"))

It can be noticed that the difference among having diabetes or not is being quite well distinguished in the two first components. We will continue by creating the rotation matrix. In PCA this matrix is like a set of instructions that turns our data points into a new set of directions. It helps to highlight the most important information in the data. We need to think of it as finding the best view tu understand patterns in the data.

pca_fit |>
  tidy(matrix = "rotation")
# A tibble: 529 × 3
   column             PC   value
   <chr>           <dbl>   <dbl>
 1 Diabetes_binary     1  0.282 
 2 Diabetes_binary     2 -0.250 
 3 Diabetes_binary     3  0.0348
 4 Diabetes_binary     4  0.0538
 5 Diabetes_binary     5 -0.192 
 6 Diabetes_binary     6 -0.229 
 7 Diabetes_binary     7  0.0411
 8 Diabetes_binary     8 -0.0303
 9 Diabetes_binary     9 -0.0329
10 Diabetes_binary    10  0.0256
# ℹ 519 more rows

Finally we will look at the variance explained by each PC. To do this, we will compute the eigenvalues. But, what is an eigenvalue representing? An eigenvalue in PCA represents the amount of variance captured by a principal component. It tells us how much information that component holds: larger eigenvalues mean more important components.

pca_fit |>
  tidy(matrix = "eigenvalues")
# A tibble: 23 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   2.00   0.175       0.175
 2     2   1.36   0.0807      0.255
 3     3   1.22   0.0646      0.320
 4     4   1.15   0.0571      0.377
 5     5   1.10   0.0526      0.430
 6     6   1.06   0.0488      0.478
 7     7   1.02   0.0451      0.523
 8     8   0.989  0.0426      0.566
 9     9   0.970  0.0409      0.607
10    10   0.939  0.0383      0.645
# ℹ 13 more rows

Here we can see those results in a plot.

pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_bar(fill = "#66c2a5", alpha = 0.7, stat = "identity") +
  scale_x_continuous(breaks = 1:22) +
  theme_minimal()

Having reached this part of the project, we now want to only select a certain amount of components to be used in our models. Selecting the number of components in PCA depends on our goals and the amount of data we are willing to retain. A common approach is to choose enough components to capture a significant portion of the total variance, often aiming for around 70-80%. This is of course a trade-off. The more components we have, the more information we will have. However, more noise will be introduced.

To help us finding the desired number of components, we will create a scree plot and select those whose cumulative sum achieves an 80% of cumulative variance.

cumulative_var <- cumsum(pca_fit$sdev^2) / sum(pca_fit$sdev^2)

plot_data <- data.frame(
  PC = seq_along(cumulative_var),
  Cumulative_Proportion = cumulative_var
)

plot_4 <- ggplot(plot_data, aes(x = PC, y = Cumulative_Proportion)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Number of Principal Components",
    y = "Cumulative Proportion of Variance Explained"
  ) +
  geom_hline(yintercept = 0.7, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 0.75, linetype = "dashed", color = "blue") +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "green")

plot_4

Looking at the results of the plot, we will select a total of 15 components.

Prediction model

After performing PCA and selecting the appropiate number of components, we will build a simple logistic regression model for prediction tasks. In our case, the target variable will be Diabetes_binay and the following procedure will be made:

  1. Dimensionality reduction: We will select 15 principal components from the results of a our PCA analysis (pca_fit).

  2. Data splitting: The dataset is divided into training and testing sets (70% training, 30% testing) using random indices.

  3. Model training: A logistic regression model (glm) is trained on the training data, where the outcome variable is binary (Diabetes_binary).

  4. Prediction and evaluation: The model is used to predict outcomes on the test data. Predictions are converted to binary (0 or 1) based on a threshold of 0.5. To evaluate the model’s performance, a confusion matrix is created and accuracy is computed as well.

num_components <- 15

selected_components <- as.data.frame(pca_fit$x[, 1:num_components])

set.seed(123) 
train_indices <- sample(seq_len(nrow(df)), 0.7 * nrow(df))

train_data <- selected_components |>
  slice(train_indices)

train_outcome <- df$Diabetes_binary[train_indices]

test_data <- selected_components |>
  slice(-train_indices)

test_outcome <- df$Diabetes_binary[-train_indices]

model <- glm(train_outcome ~ ., family = "binomial", data = cbind(train_outcome, train_data))

predictions <- predict(model, newdata = as.data.frame(test_data), type = "response")

binary_predictions <- ifelse(predictions > 0.5, 1, 0)

conf_matrix <- table(Predicted = binary_predictions, Actual = test_outcome)

accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

conf_matrix |>
  print()
         Actual
Predicted    0    1
        0 9179 1316
        1 1430 9283
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8705206 

One of the methods used to check the model’s performance is a confusion matrix. It is a table that helps evaluate the performance of a classification model. It compares the predicted values to the actual values in a dataset and consists of four components. Those components are then used to compute some other metrics such as accuracy (the one that we have printed). The results obtained by applying the selected 15 components are quite good as we are achieving a value of 87% in accuracy.

Men VS. Women (GLM)

The last part of the analysis will be focused in detecting whether a GLM works better for men or women. We want to help tackling gender bias in science. This means the data might not treat men and women fairly. This could happen if there aren’t enough women included in studies, or if the way data is collected or analyzed favors one gender over the other. Fixing this means making sure studies include a good mix of men and women and treating everyone equally in how data is handled.

First of all we will take a look to the distribution between men and women in our data set.

df |> 
  count(Sex)
# A tibble: 2 × 2
    Sex     n
  <dbl> <int>
1     0 38386
2     1 32306

Value 0 means women and value 1 means men. So according to our data, we have slightly more women individuals rather than men. Lets generate two different datasets according to gender and apply logistic regression to see their performances.

women_data <- df |>
  filter(Sex == 0) |>
  select_if(is.numeric)

men_data <- df |>
  filter(Sex == 1) |>
  select_if(is.numeric)

Lets use first women data in the model.

model_women <- glm(data = women_data,
                Diabetes_binary ~ .,
                family = binomial)

tidy_summary_women <- tidy(model_women)
glance_summary_women <- glance(model_women)
full_summary_women <- bind_cols(tidy_summary_women, glance_summary_women)
print(full_summary_women)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc…  -6.52     0.177      -36.9  2.69e-297        53151.   38385 -18999.
 2 HighBP     0.809    0.0275      29.4  1.03e-189        53151.   38385 -18999.
 3 HighChol   0.664    0.0262      25.4  6.67e-142        53151.   38385 -18999.
 4 CholChe…   1.36     0.119       11.4  2.56e- 30        53151.   38385 -18999.
 5 BMI        0.0760   0.00203     37.4  8.31e-306        53151.   38385 -18999.
 6 Smoker    -0.0355   0.0262      -1.36 1.75e-  1        53151.   38385 -18999.
 7 Stroke     0.192    0.0562       3.42 6.26e-  4        53151.   38385 -18999.
 8 HeartDi…   0.245    0.0435       5.64 1.73e-  8        53151.   38385 -18999.
 9 PhysAct…  -0.0333   0.0289      -1.15 2.49e-  1        53151.   38385 -18999.
10 Fruits     0.0332   0.0278       1.20 2.32e-  1        53151.   38385 -18999.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

Now lets create a model for men data.

model_men <- glm(data = men_data,
                Diabetes_binary ~ .,
                family = binomial)

tidy_summary_men <- tidy(model_men)
glance_summary_men <- glance(model_men)
full_summary_men <- bind_cols(tidy_summary_men, glance_summary_men)
print(full_summary_men)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc…  -6.83     0.177     -38.5   0                44710.   32305 -17109.
 2 HighBP     0.676    0.0285     23.7   8.04e-124        44710.   32305 -17109.
 3 HighChol   0.516    0.0274     18.8   3.98e- 79        44710.   32305 -17109.
 4 CholChe…   1.33     0.111      12.0   4.45e- 33        44710.   32305 -17109.
 5 BMI        0.0722   0.00250    28.8   6.66e-183        44710.   32305 -17109.
 6 Smoker     0.0232   0.0276      0.840 4.01e-  1        44710.   32305 -17109.
 7 Stroke     0.124    0.0598      2.08  3.78e-  2        44710.   32305 -17109.
 8 HeartDi…   0.241    0.0377      6.40  1.52e- 10        44710.   32305 -17109.
 9 PhysAct…  -0.0357   0.0316     -1.13  2.60e-  1        44710.   32305 -17109.
10 Fruits    -0.105    0.0277     -3.77  1.61e-  4        44710.   32305 -17109.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

The results show how the model containing only male patients achieves a lower AIC than the model with only female patients. It is true that the difference is not very big. It can be seen how the male model gives more importance to variables related to physical health as well as the consumption of fruits instead of vegetables. The latter may make sense since there is a greater amount of sugar in fruits than in vegetables.

Something that should be highlighted is that it has a significant improvement with the GLM built in the first part of the analysis. It manages to lower the value of the AIC a lot. This is something that we did not have in mind that could happen and that is why it can be considered as a task to analyze in the future.

Saving Plots

plot_4 |> 
  ggsave(filename = "06_key_plot_4.png", path = "../results/")